Frame

The client bank XYZ is running a direct marketing campaign. It wants to identify customers who would potentially be buying their new term deposit plan.

Acquire

Data is obtained from UCI Machine Learning repository. http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing

Data from direct marketing campaign (phone calls) of a Portuguese Bank is provided.

Attribute Information:

bank client data:

age (numeric)
job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
default: has credit in default? (categorical: 'no','yes','unknown')
housing: has housing loan? (categorical: 'no','yes','unknown')
loan: has personal loan? (categorical: 'no','yes','unknown')

contact: contact communication type (categorical: 'cellular','telephone')
month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

other attributes:

campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

emp.var.rate: employment variation rate - quarterly indicator (numeric)
cons.price.idx: consumer price index - monthly indicator (numeric)
cons.conf.idx: consumer confidence index - monthly indicator (numeric)
euribor3m: euribor 3 month rate - daily indicator (numeric)
nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

y - has the client subscribed a term deposit? (binary: 'yes','no')

The given data is randomly divided into train and test for the purpose of this workshop. Build the model for train and use it to predict on test.

Explore



In [1]:

    
#Import the necessary libraries
import numpy as np
import pandas as pd



In [2]:

    
#Read the train and test data
train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")

Exercise 1

print the number of rows and columns of train and test



In [16]:









    



(35211, 17) (10000, 17)

Exercise 2

Print the first 10 rows of train



In [4]:









    Out[4]:






  
    
      
      age
      job
      marital
      education
      default
      balance
      housing
      loan
      contact
      day
      month
      duration
      campaign
      pdays
      previous
      poutcome
      deposit
    
  
  
    
      0
      58
      management
      married
      tertiary
      no
      2143
      yes
      no
      unknown
      5
      may
      261
      1
      -1
      0
      unknown
      no
    
    
      1
      44
      technician
      single
      secondary
      no
      29
      yes
      no
      unknown
      5
      may
      151
      1
      -1
      0
      unknown
      no
    
    
      2
      33
      entrepreneur
      married
      secondary
      no
      2
      yes
      yes
      unknown
      5
      may
      76
      1
      -1
      0
      unknown
      no
    
    
      3
      47
      blue-collar
      married
      unknown
      no
      1506
      yes
      no
      unknown
      5
      may
      92
      1
      -1
      0
      unknown
      no
    
    
      4
      33
      unknown
      single
      unknown
      no
      1
      no
      no
      unknown
      5
      may
      198
      1
      -1
      0
      unknown
      no

Exercise 3

Print the column types of train and test. Are they the same in both train and test?



In [5]:

    
#train









    Out[5]:





age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
deposit      object
dtype: object



In [6]:

    
#test









    Out[6]:





age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
deposit      object
dtype: object



In [7]:

    
#Are they the same?



In [64]:

    
#Combine train and test
frames = [train, test]
input = pd.concat(frames)



In [9]:

    
#Print first 10 records of input









    Out[9]:






  
    
      
      age
      job
      marital
      education
      default
      balance
      housing
      loan
      contact
      day
      month
      duration
      campaign
      pdays
      previous
      poutcome
      deposit
    
  
  
    
      0
      58
      management
      married
      tertiary
      no
      2143
      yes
      no
      unknown
      5
      may
      261
      1
      -1
      0
      unknown
      no
    
    
      1
      44
      technician
      single
      secondary
      no
      29
      yes
      no
      unknown
      5
      may
      151
      1
      -1
      0
      unknown
      no
    
    
      2
      33
      entrepreneur
      married
      secondary
      no
      2
      yes
      yes
      unknown
      5
      may
      76
      1
      -1
      0
      unknown
      no
    
    
      3
      47
      blue-collar
      married
      unknown
      no
      1506
      yes
      no
      unknown
      5
      may
      92
      1
      -1
      0
      unknown
      no
    
    
      4
      33
      unknown
      single
      unknown
      no
      1
      no
      no
      unknown
      5
      may
      198
      1
      -1
      0
      unknown
      no

Exercise 4

Find if any column has missing value There is a pd.isnull function. How to use that?



In [12]:









    Out[12]:





age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
deposit      0
dtype: int64



In [65]:

    
#Replace deposit with a numeric column
#First, set all labels to be 0
input.at[:, "depositLabel"] = 0
#Now, set depositLabel to 1 whenever deposit is yes
input.at[input.deposit=="yes", "depositLabel"] = 1



In [ ]:

Exercise 5

Find % of customers in the input dataset who have purchased the term deposit



In [72]:









    Out[72]:





11.698480458295547



In [75]:

    
#Create the labels 
labels = 
labels









    Out[75]:





0       0
1       0
2       0
3       0
4       0
5       0
6       0
7       0
8       0
9       0
10      0
11      0
12      0
13      0
14      0
15      0
16      0
17      0
18      0
19      0
20      0
21      0
22      0
23      0
24      0
25      0
26      0
27      0
28      0
29      0
       ..
9970    1
9971    1
9972    1
9973    1
9974    1
9975    1
9976    1
9977    1
9978    1
9979    1
9980    1
9981    1
9982    1
9983    1
9984    1
9985    1
9986    1
9987    1
9988    1
9989    1
9990    1
9991    1
9992    1
9993    1
9994    1
9995    1
9996    1
9997    1
9998    1
9999    1
Name: depositLabel, dtype: int64



In [83]:

    
#Drop the deposit column 
input.drop(["deposit", "depositLabel"], axis=1)

Exercise 6

Did it drop? If not, what has to be done?

Exercise 7

Print columnn names of input



In [ ]:



In [85]:

    
#Get list of columns that are continuous/integer
continuous_variables = input.dtypes[input.dtypes != "object"].index



In [86]:

    
continuous_variables









    Out[86]:





Index([u'age', u'balance', u'day', u'duration', u'campaign', u'pdays',
       u'previous'],
      dtype='object')



In [87]:

    
#Get list of columns that are categorical
categorical_variables = input.dtypes[input.dtypes=="object"].index



In [88]:

    
categorical_variables









    Out[88]:





Index([u'job', u'marital', u'education', u'default', u'housing', u'loan',
       u'contact', u'month', u'poutcome'],
      dtype='object')

Exercise 8

Create inputInteger and inputCategorical - two datasets - one having integer variables and another having categorical variables



In [89]:

    
inputInteger =



In [91]:

    
#print inputInteger
inputInteger.head()



In [93]:

    
inputCategorical =



In [94]:

    
#print inputCategorical
inputCategorical.head()









    Out[94]:






  
    
      
      job
      marital
      education
      default
      housing
      loan
      contact
      month
      poutcome
    
  
  
    
      0
      management
      married
      tertiary
      no
      yes
      no
      unknown
      may
      unknown
    
    
      1
      technician
      single
      secondary
      no
      yes
      no
      unknown
      may
      unknown
    
    
      2
      entrepreneur
      married
      secondary
      no
      yes
      yes
      unknown
      may
      unknown
    
    
      3
      blue-collar
      married
      unknown
      no
      yes
      no
      unknown
      may
      unknown
    
    
      4
      unknown
      single
      unknown
      no
      no
      no
      unknown
      may
      unknown



In [101]:

    
#Convert categorical variables into Labels using labelEncoder

inputCategorical = np.array(inputCategorical)

Exercise 9

Find length of categorical_variables



In [102]:









    Out[102]:





9



In [119]:

    
#Load the preprocessing module
from sklearn import preprocessing



In [103]:

    
for i in range(len(categorical_variables)):
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(inputCategorical[:,i]))
    inputCategorical[:, i] = lbl.transform(inputCategorical[:, i])



In [105]:

    
#print inputCategorical

Exercise 10

Convert inputInteger to numpy array



In [107]:

    
inputInteger = 
inputInteger









    Out[107]:





array([[  58, 2143,    5, ...,    1,   -1,    0],
       [  44,   29,    5, ...,    1,   -1,    0],
       [  33,    2,    5, ...,    1,   -1,    0],
       ..., 
       [  69,  247,   22, ...,    2,   -1,    0],
       [  48,    0,   28, ...,    2,   -1,    0],
       [  31,  131,   15, ...,    1,   -1,    0]])

Exercise 11

Now, create the inputUpdated array that has both inputInteger and inputCategorical concatenated

Hint Check function called vstack and hstack



In [ ]:



In [118]:

    
inputUpdated.shape









    Out[118]:





(45211, 16)

Train the model

Model 1: Decision Tree



In [125]:

    
from sklearn import tree
from sklearn.externals.six import StringIO
import pydot



In [126]:

    
bankModelDT = tree.DecisionTreeClassifier(max_depth=2)



In [127]:

    
bankModelDT.fit(inputUpdated[:train.shape[0],:], labels[:train.shape[0]])









    Out[127]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')



In [128]:

    
dot_data = StringIO() 
tree.export_graphviz(bankModelDT, out_file=dot_data) 
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
graph.write_pdf("bankDT.pdf")









    Out[128]:





True



In [129]:

    
#Check the pdf

Exercise 12

Now, change the max_depth = 6 and check the results.

Then, change the max_depth= None and check the results



In [ ]:



In [144]:

    
# Prediction
prediction_DT = bankModelDT.predict(inputUpdated[train.shape[0]:,:])



In [133]:

    
#Compute the error metrics



In [134]:

    
import sklearn.metrics



In [135]:

    
sklearn.metrics.auc(labels[train.shape[0]:], prediction_DT)









    Out[135]:





0.5



In [136]:

    
#What does that tell?



In [137]:

    
#What's the error AUC for the other Decision Tree Models

Exercise 13

Instead of predicting classes directly, predict the probability and check the auc



In [ ]:



In [142]:

    
sklearn.metrics.auc(labels[train.shape[0]:], prediction_DT[:,0])









    Out[142]:





0.54849867669154428

Accuracy Metrics

AUC
ROC
Misclassification Rate
Confusion Matrix
Precision & Recall

Confusion Matrix

Calculate True Positive Rate

TPR = TP / (TP+FN)

Calculate False Positive Rate

FPR = FP / (FP+TN)

Precision

Recall



In [147]:

    
#Precision and Recall



In [145]:

    
sklearn.metrics.precision_score(labels[train.shape[0]:], prediction_DT)









    Out[145]:





0.57177033492822971



In [146]:

    
sklearn.metrics.recall_score(labels[train.shape[0]:], prediction_DT)









    Out[146]:





0.20427350427350427

Ensemble Trees

src: http://www.slideshare.net/hustwj/scaling-up-machine-learning-the-tutorial-kdd-2011-part-iia-tree-ensembles

Random Forest

src: http://www.slideshare.net/0xdata/jan-vitek-distributedrandomforest522013



In [148]:

    
from sklearn.ensemble import RandomForestClassifier



In [157]:

    
bankModelRF = RandomForestClassifier(n_jobs=-1, oob_score=True)



In [158]:

    
bankModelRF.fit(inputUpdated[:train.shape[0],:], labels[:train.shape[0]])









    Out[158]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)



In [156]:

    
bankModelRF.oob_score_









    Out[156]:





0.89128397375820057

Exercise 14

Do the following

Predict on test
Find accuracy metrics: AUC, Precision, Recall
How does it compare against Decision Tree



In [ ]:

Gradient Boosting Machines